
Mass-scale computing: Why Hadoop is hot but Java is not | VentureBeat
First, Java hogs extra resources (cache, memory, CPU cycles), a fact that doesn??t always show up well in single benchmark tests, but does show up clearly when multiple Java benchmarks compete for resources at the same time. This alone loses Java 15%, as tests I ran showed.
On top of that, on average, Java loses to C++ by about 15%, especially when apps can be compiled for mass-scale computing (use profile-guided compilation, etc). So you??re down 30% to start with by implementing a Java Hadoop app on top of the Java-based Hadoop infrastructure. Now, you don??t have to write your app in Java; you can use C++ or even script languages. See for example, why the Hypertable project chose C++. But unfortunately, the choice of Java for infrastructure and the bevy of available libraries is driving many people to use Java for Hadoop. Let??s look at what that means financially as we scale out increasingly. 
concurrency development scalability tech web
Microsoft Offers Two Database Previews: SQL Server & SQL Azure - ReadWriteEnterprise
SQL Azure is a relational cloud database along the lines of Amazon’s SimpleDB (when it comes to the business model). In other words, it’s keyed towards providing pay-as-you-go scalability at a minimal infrastructure cost. This trial is a key development for the Azure platform. The free trial lasts until November, after which it’ll cost $9.99/month for 1GB, or $99.99 for 10GB.
cloud database tech web
a good post comparing Cascading and Pig.
Goodbye MapReduce, Hello Cascading | Engineering Rapleaf
The most recognizable competing product to Cascading is Pig, a Yahoo technology we also explored. Pig lets you specify batch queries in a neat SQL like syntax, but we found Pig unusable due to the inability to plug in custom input and output formats. One of the nicest things about Cascading is that it doesn¡Çt restrict you in any way ? anything you can do via vanilla MapReduce you can do via Cascading. We like the fact that Cascading flows are all specified via a Java API rather than a SQL like language ? this makes it very natural to create custom functions and very complex workflows. And if some part of your workflow is really performance-critical, Cascading gives you the flexibility to hand-code that part of the workflow with a MapReduce job and plug it in as a custom Flow.
cloud concurrency scalability tech
a good read.
Go Ahead: Next Generation Java Programming Style | Code Monkeyism
No setters. Many Java developers automatically - sometimes with the evil help of an IDE - write setters for all fields in their classes. You should not use setters. Think about each setter you want to write, are they really necessary for your fields? Better create new copies of your objects if values change. And try to write code without getters either. Tell, don¡Çt ask tells you more about the concept.
development tech
must take YQL for a spin.
All the Web’s a Database: Yahoo Extends YQL With Insert, Update, Delete
While the earlier incarnations of YQL were mainly meant to read data, with the addition of these three new SQL verbs, the focus has now shifted towards writing data back to the net as well. Developers can now use YQL to write and modify data on web services and applications.
To explain how useful this can be, the Yahoo team used a few different examples. A developer can now easily use YQL to update a Twitter account (even authentication with OAuth is possible), for example, or add a new comment to a blog post, or insert any data into a remote database. Basically, developers can now use YQL to write data back to any web site that uses forms for data entry and to any API, including authenticated APIs.
development tech web yahoo
might look into this later.
Daemon - Daemon : Procrun
Procrun is a set of libraries and applications for making Java applications to run on WIN32 much easier. Procrun service application Prunsrv is a service application for running applications as services. It can convert any application to run as a service. Procrun monitor application Prunmgr is a GUI application for monitoring and configuring procrun services.
development tech
Edsger Dijkstra and his well-known opposition to the break construction. (via @debasishg)
The Universe of Discourse : Hoare logic
Now consider a more complex block, one of the form: if (q) { E; }
else { F; }
Suppose you believe that code E, given precondition x, is guaranteed to produce postcondition y. And suppose you believe the same thing about F. Then you can conclude the same thing about the entire if-else block: if x was true before it began executing, then y will be true when it is done.[2] So you can build up proofs (or beliefs) about small bits of code into proofs (or beliefs) about larger ones. We can understand while loops similarly. Suppose we know that condition p is true prior to the commencement of some loop, and that if p is true before G executes, then p will also be true when G finishes. Then what can we say about this loop? while (q) { G; }
We can conclude that if p was true before the loop began, then p will still be true, and q will be false, when the loop ends. BUT BUT BUT BUT if your language has break, then that guarantee goes out the window and you can conclude nothing. Or at the very least your conclusions will become much more difficult. You can no longer treat G atomically; you have to understand its contents in detail. So this is where Dijkstra is coming from: features like break[3] tend to sabotage the benefits of structured programming, and prevent the programmer from understanding the program as a composition of independent units.
development tech
checking out the source code right now.
Ruby Language Center - MongoDB - 10gen Confluence
mongo-activerecord-ruby
MongoRecord is an ActiveRecord pattern implementation for Mongo DB. The interface forgoes compatibility for RoR ActiveRecord and instead provides a clean interface with no reference to SQL or relational underpinnings.
database scalability tech
These are great posts on Hive.
Perspectives - Hadoop Summit Notes #5 (final): HBase, Rapleave, Hive, Autodesk, Computing in the Cloud, & Future Direction Panel
Data Warehousing use Hadoop
- Hive is the Facebook datawarehouse
- Query language brings together SQL and streaming
- Developers love direct access to map/reduce and streaming
- Analyst love SQL
- Hive QL (parser, planner, and execution engine)
- Uses the Thrift API
- Hive CLI implemented in Python
- Query operators in initial versions
- Projections, equijoins, cogroups, groupby, & sampling
- Supports views as well
- Supports 40 users (about 25% of engineering team)
- 200GB of compressed data per day
- 3,514 jobs run over the last 7 days
- 5 engineers on the project
- Q: Why not use PIG? A: Wanted to support SQL and python.
Hive + Hadoop + S3 + EC2 = It works! ? Joydeep Sen Sarma??s blog
The use case i had in mind was something like this:
- A user is storing files containing structured data in S3 - Amazon??s store for bulk data. A very realistic use case could be a web-admin archiving Apache logs into S3 - or even transaction logs from some db
- Now this user wants to run sql queries on these files - perhaps to do some historical analysis
- Amazon provides compute resources on demand via EC2 - ideally these sql queries should use some allocated number of machines in EC2 to perform the required computations
- The results of sql queries that are interesting for later use should be easily stored back in S3
cloud database scalability tech web
an alternative for Happy?
must look into the details.
Skills Matter : Hadoop User Group UK:Dumbo: Hadoop streaming
Dumbo: Hadoop streaming made elegant and easy
At Last.fm, the number of “write once, run never again” Hadoop programs has been growing steadily, especially in the research team. Since Java is a very verbose and compiled programming language, it is not very suitable for writing such programs. A better way to quickly write MapReduce programs is provided by Hadoop Streaming, but it still is less convenient than it could be. Dumbo is a simple enhancement to Hadoop Streaming that addresses this issue. More specifically, it is Python module that makes Hadoop Streaming elegant and easy.
Hadoop User Group ? UK ? Rich Marr¡Çs Tech Blog
Dumbo ? Klass Bosteels Klass has implemented Dumbo, a system that allows you to write disposable Hadoop streaming programs in Python. The aim was to reduce the amount of work involved writing one-off jobs. This seems like it could become part of the Hadoop toolset as it¡Çs certain to be useful to a lot of people.
Last.fm ? the Blog ? Python + Hadoop = Flying Circus Elephant
The approach described here is the most convenient way of writing Hadoop programs in Python that I could find on the web, but it still wasn¡Çt pythonic enough for my taste. The mapper and the reducer shouldn¡Çt have to reside in separate files, and having to write boilerplate code should be avoided as much as possible. To get rid of these issues, I wrote a simple Python module called Dumbo.
cloud scalability search tech web
Amazon Elastic MapReduce: A Web Service API for HadoopData Wrangling Blog
The base EC2 images underlyin... Hadoop 0.18 Highlights (Hadoop and Distributed Computing at Yahoo!)Hadoop 0.18 Highlights (Hadoop and Distributed Com... Hadoop + Python = HappyGoogle Code
Happy is a framework for writing ma... Some Hive detailsThese are great posts on Hive.
Perspectives -... Site Search Apache Hadoop Wins Terabyte Sort BenchmarkHadoop and Distributed Computing at Yahoo!
...
|
|
|